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Abstract 

Background: We study the sparsification of dynamic programming based on folding algorithms of RNA structures. 
Sparsification is a method that improves significantly the computation of minimum free energy (mfe) RNA structures. 

Results: We provide a quantitative analysis of the sparsification of a particular decomposition rule, A*. This rule splits 
an interval of RNA secondary and pseudoknot structures of fixed topological genus. Key for quantifying sparsifications 
is the size of the so called candidate sets. Here we assume mfe-structures to be specifically distributed (see 
Assumption 1) within arbitrary and irreducible RNA secondary and pseudoknot structures of fixed topological genus. 
We then present a combinatorial framework which allows by means of probabilities of irreducible sub-structures to 
obtain the expectation of the A*-candidate set w.r.t. a uniformly random input sequence. We compute these 
expectations for arc-based energy models via energy-filtered generating functions (GF) in case of RNA secondary 
structures as well as RNA pseudoknot structures. Furthermore, for RNA secondary structures we also analyze a 
simplified loop-based energy model. Our combinatorial analysis is then compared to the expected number of 
A*-candidates obtained from the folding mfe-structures. In case of the mfe-folding of RNA secondary structures with 
a simplified loop-based energy model our results imply that sparsification provides a significant, constant 
improvement of 91% (theory) to be compared to an 96% (experimental, simplified arc-based model) reduction. 
However, we do not observe a linear factor improvement. Finally, in case of the "full" loop-energy model we can 
report a reduction of 98% (experiment). 

Conclusions: Sparsification was initially attributed a linear factor improvement. This conclusion was based on the so 
called polymer-zeta property, which stems from interpreting polymer chains as self-avoiding walks. Subsequent 
findings however reveal that the 0(n) improvement is not correct. The combinatorial analysis presented here shows 
that, assuming a specific distribution (see Assumption 1 ), of mfe-structures within irreducible and arbitrary structures, 
the expected number of A*-candidates is 0(n^). However, the constant reduction is quite significant, being in the 
range of 96%. We furthermore show an analogous result for the sparsification of the A*-decomposition rule for RNA 
pseudoknotted structures of genus one. Finally we observe that the effect of sparsification is sensitive to the 
employed energy model. 

Keywords: Sparsification, Generating function. Dynamic programming 



Background 

RNA structures, diagrams and genus filtration 

An RNA sequence is a linear, oriented sequence of the 
nucleotides (bases) A,U,G,C. These sequences "fold" by 
establishing bonds between pairs of nucleotides. In this 
paper, we only consider the Watson-Crick base pair A-U 
or G-C and wobble base pairs U-G. The global confor- 
mation of an RNA molecule is determined by topological 
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constraints encoded at the level of secondary structure, 
i.e., by the mutual arrangements of the base pairs [1]. 

Secondary structures can be interpreted as (partial) 
matchings in a graph of permissible base pairs [2]. They 
can be represented as diagrams, i.e. graphs over the ver- 
tices 1, . . . , drawn on a horizontal line with bonds (arcs) 
in the upper half-plane. The length of an arc (/,;') is 
denoted by ; — /. Furthermore, we call two arc (/,;) and 
(r, s) (suppose / < r) cross if i < r < j < s holds. In this 
representation one refers to a secondary structure with- 
out crossing arcs as a simple secondary structure and 
pseudoknot structure, otherwise, see Figure 1. 
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Figure 1 RNA structures as planar graphs and diagrams. (A) an RNA secondary structure and (B) an RNA pseudoknot structure. 



A diagram is a labeled graph over the vertex set [ n] = 
{1, . . . ,n} in which each vertex has degree < 3, repre- 
sented by drawing its vertices in a horizontal line. The 
backbone of a diagram is the sequence of consecutive inte- 
gers (1, . . . , together with the edges {{/, / + 1} | 1 < / < 
n — 1}, The arcs of a diagram, (/,;), where / < are drawn 
in the upper half-plane. We shall distinguish the backbone 
edge {/, / + 1} from the arc (/, / + 1), which we refer to as a 
1-arc. A stack of length € is a maximal sequence of "paral- 
lel" arcs, ((/,;), (/ + 1,; -!),...,(/ + (€- 1),; -(I- 1))) 
and is also referred to as a €-stack, see Figure 2. 

We shall consider diagrams as fatgraphs, G, that is 
graphs G together with a collection of cyclic orderings, 
called fattenings. Each fatgraph G determines an oriented 
surface F(G) [3,4] which is connected if G is and has 
some associated genus g(G) > 0 and number r(G) > 1 
of boundary components. Clearly, F(G) contains G as a 
deformation retract [5]. Fatgraphs were first applied to 
RNA secondary structures in [6,7]. 

A diagram G hence determines a unique surface F(G) 
(with boundary). Filling the boundary components with 
discs we can pass from F(G) to a surface without bound- 
ary. Euler characteristic, x> and genus, g, of this surface is 
given byx = v — e -\- r and g = I — respectively, 
where v, e, r is the number of discs, ribbons and boundary 
components in G, [5]. The genus of a diagram is that of 



its associated surface without boundary and a diagram of 
genus g is referred to as ^-diagram. 

A ^-diagram without arcs of the form / + 1) (1-arcs) 
is called a ^-structure. A ^-diagram that contains only ver- 
tices of degree three, i.e. does not contain any vertices 
not incident to arcs in the upper half-plane, is called a g- 
matching. A diagram is called irreducible, if and only if it 
cannot be split into two by cutting the backbone without 
cutting an arc, see Figure 2. 

Folding algorithms 

Folded configurations are energetically somewhat opti- 
mal. Here energy is obtained by adding contributions of 
loops [8] contained in RNA secondary and pseudoknot 
structures. Any RNA structure has a unique and disjoint 
decomposition into such loops which are really stems 
from the fatgraph [9,10] interpretation of such structures 
in which loops correspond to boundary components [11]. 
Additional constraints imply further properties, like for 
instance certain minimum arc-length conditions [12] and 
the nonexistence of isolated bonds. An mfe-RNA struc- 
ture can be predicted in polynomials time by means of 
dynamic programming (DP) routines [12,13]. 

The most commonly used tools predicting simple RNA 
secondary structure mf old [13] and the Vienna RNA 
Package [14], require 0(n^) space and 0(n^) time. In the 
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Figure 2 Diagram representation and irreducibility. A diagram over {1, . . . ,55}. The arcs (1,21) and (1 1,33) are crossing and the dashed arc 
(9, 1 0) is a 1 -arc which is not allowed. This structure contains 4 stacks with length 7, 4, 6 and 4, from left to right respectively. Irreducibility relative 
also to a decomposition rule. The rule A* splitting Sjj to Sj^k and Sk+]j, Si,55 is not A*-irreducible, while 82,40 and 543,55 are. However, for a specific 
decomposition rule A, which removes the outmost arc, 543,55 is not A-irreducible while 52,4o is. 



following we omit "simple" and refer to secondary struc- 
tures containing crossing arcs as pseudoknot structures. 

Generalizing the matrices of the DP-routines of sec- 
ondary structure folding [13,14] to gap-matrices [15], 
leads to a DP-folding of pseudoknotted structures [15] 
(pknot-R&E) with 0{n^) space an 0(rfi) time complex- 
ity. The following references provide a certainly incom- 
plete list of DP-approaches to RNA pseudoknot structure 
prediction using various structure classes characterized in 
terms of recursion equations and/or stochastic grammars: 
[9,15-26]. The most efficient algorithm for pseudoknot 
structures is [22] (pknotsRG) having O(n^) space and 
0(n^) time complexity. This algorithm however considers 
only a restricted class of pseudoknots. 

Note that RNA secondary structures are exactly struc- 
tures of topological genus zero [27]. The topological clas- 
sification of RNA structures [10,11,28] has recently been 
translated into an efficient DP-algorithm [9]. Fixing the 
topological genus of RNA structures implies that there 
are only finitely many types, the so called irreducible 
shadows [11]. 

Sparsification 

Let us have a closer look at sparsification and the results 
of [29-31]. Sparsification is a method tailored to speed 
up DP-algorithms predicting mfe-secondary structures 
[29,31]. The idea is to prune certain computation paths 
encountered in the DP-recursions, see Figure 3A. Let us 
consider the case of RNA secondary structure folding. 
Here sparsification reduces the DP-recursion paths to be 
based on so called candidates. A candidate is in this case 
an interval, for which the optimal solution cannot be writ- 
ten as a sum of optimal solutions of sub-intervals. This 
implies the structure over a candidate is an "irreducible" 
structures when tracing back from the optimal solution. 
Considering only these candidates gives the same optimal 
solution as considering all possible intervals. The crucial 
observation here is that if these irreducibles appear only 



at a low rate we have a significant reduction in time and 
space complexity. 

Sparsification has been also applied in the context of 
RNA-RNA interaction structures [30] as well as RNA 
pseudoknot structures [32]. In difference to RNA sec- 
ondary structures, however, not every decomposition 
rule in the DP-folding of RNA pseudoknot structures 
is amendable to sparsification. By construction, sparsi- 
fication can only be applied for calculating mfe-energy 
structures. Since the computation of the partition func- 
tion [20,33] needs to take into account all sub-structures, 
sparsification does not work. 

Sparsification [29,31,32] can be described as follows: 
let V = {vi,V2, ...} be a set whose elements v/ are 
unions of pair wise disjoint intervals. Let furthermore Ly 
denote an optimal solution (here optimal means to max- 
imize the scores) of the DP-routine over v. By assump- 
tion Ly is recursively obtained. Suppose we are given a 
decomposition rule Ai, for which the optimal solution 
Ly is Ly = Ly^ + Ly^ + Ly^, wheve V = V1UV2UV3. 
Then, under certain circumstances, the DP-routine may 
interpret Ly either as (Ly^ + Ly^) + Ly^ or as Ly^ + 
(Lv2 -\- Ly^), see Figure 3B. To be precise, this situation is 
encountered iff 

• there exists an optimal solution L^/^ for a 
sub-structure over v[ where v[ = viUv2 via A2 and 
Ly is obtained from L^'^ and Ly^ via Ai, 

• there exists an optimal solution for a 
sub-structure over V2 where V2 = V2UV3 via A3 and 
Ly is obtained by Ly^ and via Ai. 

Given a decomposition 

Ly =■ Ly-^ -\- Ly2 -\-Ly^ , 

A2 

Ai 
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Figure 3 (A) Sparsification of secondary structure folding. Suppose the optimal solution Ljj is obtained from the optimal solutions Z.,;^, 
and Lq+]j. Based on the recursions of the secondary structures, Lj^k and Lk+],q produce an optimal solution of Lj^q. Similarly, Lk+],q and Lq+]j produce 
an optimal solution of Lk+]j. Now, in order to obtain an optimal solution of Ljj it is sufficient to consider either the grouping Lj^q and Lq+]j or/.,;^ and 
Lk+]j. (B) General idea of sparsification: Ly is alternatively realized via Ly^ and /.^^, or /./^ and Ly^.Jhus it is sufficient to only consider one of the 
computation paths. 



we call A2 5-compatible to A 1 if there exists a decomposi- 
tion rule A3 such that 

Ai 

Note that if A 2 is 5-compatible to Ai then A3 is 5- 
compatible to A 1. To summarize 

Definition 1. (5- compatible) Suppose Ly is the optimal 
solution for over v, Ly = L^'^ -\-Ly^ under decomposition 
rule Ai. Ly'^ is obtained from two optimal solutions Ly-^ 
and under rule A2. Then A 2 is called s-compatible to 
Ai if there exist some rule A3 such that = Ly^ + Ly^ 
and Ly = Ly^ + L^^. 

Figure 3B depicts two such ways that realize the same 
optimal solution Ly, Sparsification prunes any such multi- 
ple computations of the same optimal value. Note that by 
symmetry, A 2 and A3 are both 5-compatible to Ai. 

We next come to the important concept of candidates. 
The latter mark the essential computation paths for the 
DP-routine. 

Definition 2. (Candidates) Suppose Ly is an optimal 
solution in a sense of maximizing. We call v is a A- 
candidate if for any vi C v obtained by A and v = viUv2, 
we have 

Ly > Ly^ -\- Ly2 

and we shall denote the set of A -candidates set by Q^. 

By construction a A -candidate v is a union of dis- 
joint intervals such that its optimal solution Ly can- 
not be obtained via a A -splitting. This optimal solution 



allows to construct a non- unique arc-configuration (sub- 
structure) over V [13,14] and the above A-splitting con- 
sequently translates into a splitting of this sub-structure. 
This connects the notion of A -candidates with that of 
sub-structures and shows that a A -candidate implies a 
sub-structure that is A -irreducible. 

Lemma 1. [29,32] Suppose Ly is obtained by select- 
ing the optimal solution from the decomposition rules 
Ai, A2, . . . , J^n- If ^ is s-compatible to all A/, VI < / < n, 
then Ly can be obtained via K-candidates. 

In summary, as for the impact of sparsification, [29] 
claims that sparsification reduces the time complexity by 
a linear factor. This claim is based on the assumption that 
RNA molecules satisfy the polymer-zeta property [29]. 
Subsequent studies draw a slightly different picture [31] 
concluding that that sparsification requires 0{nZ) time, 
where n denotes the length of input sequence, and Z is 
a sparsity parameter satisfying n < Z < n^. Recently, it 
has been shown in [34] that an asymptotic time complex- 
ity of a sparsified RNA folding algorithm using standard 
energy parameters remains 0{yfi) under a wide variety of 
condition. 

Sparsification of RNA secondary structures 

Here we recall some results of [29,31] on the sparsifica- 
tion of RNA secondary structures. Secondary structures 
satisfy a simple recursion which gives the optimal (max- 
imum) solution over [/,;] by Li j = max{V/,y, Wij], where 
Vi^j denotes the optimal solution in which (/,;) is a base 
pair, and Wq denotes the optimal solution obtained by 
adding the optimal solutions of two subsequent intervals, 
respectively. Note that the optimal solution over a single 
vertex is denoted by Lij, We have the recursion equation 
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for Vi^j and Wij: 

(Ai) Vij = + w(ij), 

i<k<j 

where w(ij) is the energy contribution of (/,;') forming a 
base pair, see Figure 4, In case two positions, ; in the 
sequence are incompatible then we have w{ij) = —oo. 

An interval [/,;] is a A* -candidate if the optimal solu- 
tion over [/,;] is given by L/,y = Vq > Wq, Indeed, [/,;] 
is a candidate iff [/,;'] is in the candidate set of A*, and 
we denote the set by Q. Suppose the optimal solution 
Wq is given by Wi^j = + Lq-\-ij and suppose we have 
Li^q = Lij^ + Lj^^i^q, Then since [ /, q] is not a candidate. 
Lemma 1 shows that we can compute Wq = Lij^ + L/^j^y, 
where [ /, k] is a candidate. 

Sparsification on RNA pseudoknot structures 

Sparsification can also be applied to the DP-algorithm 
folding RNA structures with pseudoknots [32]. In contrast 
to the decomposition rule A* that spliced an interval into 
two subsequent intervals, we encounter in the grammar 
for pseudoknotted structures additional more complex 
decomposition rules [15]. As shown in [32] there exist 
some decomposition rules which are not 5-compatible 
and which can accordingly not be sparsified at all, see 
Figure 5B. For instance, given a decomposition rule A in 
pknot-R&E subsequent decomposition rules which are 
5-compatible to A are referred to as split type of A [32] . 

In the following we will study RNA pseudoknot struc- 
tures of fixed topological genus, see RNA structures, 
diagrams and genus filtration for details. An algorithm 
folding such pseudoknot structures, gf old, has been pre- 
sented in [9]. The decomposition rules that appear in 
gfold are reminiscent to those of pknot-R&E but as 
they restrict the genus of sub-structures, the iteration of 
gap-matrices is severely restricted and the effect of sparsi- 
fication of these decompositions is significantly smaller. 

In the following, we restrict our analysis in pseudo- 
knotted structures to only the decomposition rule A*, 
which splices an interval into two subsequent intervals. 



Put differently. A* cuts the backbone of an RNA pseudo- 
knot structure of fixed genus g over one interval without 
cutting a bond. 

Efficiency of sparsification 

By construction, the fewer candidates the DP-routine 
encounters, the more efficient the sparsification. Thus it 
is of utmost importance to analyze the number of can- 
didates. In the case of sparsification of RNA secondary 
structures we have one basic decomposition rule A* act- 
ing on intervals, namely A* splices an interval into two 
disjoint, subsequent intervals. The implied notion of a A*- 
irreducible sub-structure is that of a sub-structure nested 
in a maximal arc, where maximal refers to the partial order 
of two arcs (/,;) < (/^/) iff f < i A ; < /. This observa- 
tion relates irreducibility to nesting of arcs and following 
this line of thought [29] identifies a specific property of 
polymer-chains introduced in [35,36] to be of relevance 
for the size of candidate sets: 

Definition 3. (Polymer-zeta property) Let P(/,7) 
denote the probability of a structure over an interval [ /,;'] 
under some decomposition rule A. Then we say A fol- 
lows the polymer-zeta property if P(/,;) = b for some 
constant c > 0 and m =j — i, 

Polymer-zeta comes from modeling the 2D-folding of 
a polymer chain as a self-avoiding walk (SAW) in a 2D 
lattice [37]. It implies that the probability of a base pair 
(/,;) depends only on the length of the arc, i.e. P(/,;) = 
P(m), where m = j - i. In [29] stipulate that RNA 
molecules satisfy the polymer-zeta property and approx- 
imate P(/,;) by P(m) = ^m"^ [29] using 50,000 mRNA 
sequences of an average length of 1992 nucleotides [38]. 
They find b ^ 2.11 and c ^ 1.47. The average prob- 
ability P(m) is displayed in Figure 4, Page 865 [29] for 
increasing m. Furthermore, it is implied via Figure six. 
Page 867 [29] that the average number of candidates 
converges to a constant, implying that sparsification 
of DP-routine folding secondary structure takes 0(w^) 
time complexity. 
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(A) (B) (C) 

Figure 5 Decomposition rules for pseudolcnot structures of fixed genus (decomposed into three colors). (A) three decompositions via tine 
rule A*, wliicli is s-compatible to itself. (B) three decomposition rules A] , A2, A3 where A2, A3 are s-compatible to A] . (C) three decomposition 
rules A] , A2, A3 where A2, A3 are not s-compatible to A] . 



These findings have been questioned by [34], where it 
has been observed that the time complexity of a sparsi- 
fied RNA folding algorithm based on energy minimization 
remains 0(n^) independently of the energy function used 
and the base composition of the RNA sequence. [34] 
argues that the significant effect of sparsification on the 
DP-routine is largely a finite-size effect. Namely, when the 
sequence length is below some threshold, the algorithm is 
dominated by the quadratic time factor. In this context, it 
may be worth pointing out that In [31] noticed that the 
improvement of a sparsified base-pairing maximization 
algorithm depends heavily on the base composition of the 
input. Backofen parameterizes explicitly the cardinality of 
candidate sets in [31]. 



Contribution 

In this paper we study the sparsification of the 
decomposition rule A* [31,32] for RNA secondary and 
RNA pseudoknot structures of fixed topological genus. 
Based on Assumption 1 below our paper provides a 
combinatorial framework for quantifying the effects of 
sparsification of the A* rule. 

We shall prove that the candidate set [29,31,32] is indeed 
small. We compute the probability of an interval being 
a candidate for two different energy models. For both 
models, this is facilitated via computing the generating 
function (GF) of structures and the generating function 
of irreducible structures. By studying the asymptotics of 
coefficients in these generating functions, we can compute 
the expected number of candidates of a uniformly ran- 
dom input sequence for large n. We show similar results 
for RNA pseudoknot structures of fixed topological genus. 
This provides new insights into the improvements of the 
sparsification of the concatenation-rule A* in the pres- 
ence of cross serial interactions. Our observations com- 
plement the detailed analysis of Backofen [31,32]. We 
show that although for pseudoknot structures of fixed 
topological genus [10,11] the effect of sparsification on the 
global time complexity is still unclear, the decomposition 
rule that splits an interval can be sped up significantly. 



Methods 

Suppose w is an energy function for RNA structures. Let 
ws(cr) denote the energy of an RNA structure a over a 
sequence 8, The partition function of 8 is given by 



E 



(cr) 



where R is the universal gas constant and T is the temper- 
ature. (Here we consider ws{cr) as a positive score.) The 
partition function induces a probability space in which the 
probability of a structure cr is 



P5(or) = 



w§ (a) 

e RT 



The concept of a partition function is close to that 
of a generating function. In case of g^^^C^)/^^ = i, i.e., 
each structure contributes equally regardless the underly- 
ing sequence and the partition function equals [z^]G{z)y 
where G is the generating function and [z^]G is the 
coefficient of the term z^. 

Two important energy models are arc-based [39] and 
loop-based [8], respectively. The loop-based energy- 
filtration is different from the notion of "stickiness" [40] . 
The compatibility of two positions by folding random 
sequences is considered to be 6/16, reminiscent of the 
probability of two given positions to be compatible by 
Watson-Crick and Wobble base pairs rules. 

Assumption 1. Let 



where rj > 1 is a constant, w{g) is the energy value 
assigned to a based on a given energy model and I is 
the number of arcs contained in a. Then the probability 
of a particular structure a to be the mfe-structure of a 
uniformly random input sequence is 



P(or) 



(1) 
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Asymptotics 

In this section we compute two generating functions and 
their singular expansions [11]. Let Cg(n) and dg(n) denote 
the number of^-matchings and ^-structures having n arcs 
and n vertices, respectively, with GF 



n=0 



n=0 



The GF Cg(z) has been computed in the context of the 
virtual Euler characteristic of the moduli-space of curves 
in [41] and Dg(z) can be derived from Cg(z) by means 
of symbolic enumeration [11]. The GF of genus zero dia- 
grams Co(z) is well-known to be the GF of the Catalan 
numbers, i.e., the numbers of triangulations of a polygon 
with (n + 2) sides. 



Co(^) = 



1 - Vl -4^ 
2z ' 



As for^ > 1 we have the following situation [11] 

Theorem !♦ Suppose g > 1, Then the following asser- 
tions hold 



(a) T>g{z) is algebraic and 



(2) 



In particular, z^/(z^ — z-\- 1)^ = 1/4 is the only 
dominant singularity ofDg(z). we have for some 
constant ag depending only ong and y ^ 2.618; 

[^^]D^(^) -^^H^^-^V. (3) 

(b) The bivariate GF of g- structures over n vertices, 
containing exactly m arcs, Egiz, t), is given by 



1 { tz^ \ 



Irreducible g-structures 

In the context of A* -candidates we observed that irre- 
ducible sub-structures are of key importance. It is accord- 
ingly of relevance to understand the combinatorics of 
these structures. To this end let D|(z) = E^o^lC^)^"" 
denote the GF of irreducible ^-structures. 

Lemma 2. Torg > 0, the GFD^ (z) satisfies the recursion 
DS(z) = 1 



D*(z) = - 



(Dg(^) - l)D^(^) + {z)Y^g-g. (z) 



Do(z) 

For a proof of Lemma 2, see Section Proofs. 



Theorem 2* For g > Iwe have 

(a) the GF of irreducible g-structures over n vertices is 
given by 



9 / U^C^) Va(w) 



(5) 



where u = ^giz) and\g{z) are both 

polynomials with lowest degree at least 2g, and 
U^(l/4), V^(l/4) 7^ 0. In particular, for some 
constant a^ > 0 and y ^ 2.618; 

DUn)-^alni'-^)y\ 



(6) 



(b) the bivariate GF of irreducible g-structures over n 
vertices, containing exactly m arcs, E|(^> t), is given 
by 



(l_4v)%-2 (l-4v)3s-iy' 

(7) 



where v = tt^t^ 



(te2_2+l)2- 

We shall postpone the proof of Theorem 2 to 
Section Proofs. 

The main result 
Nussinov-like energy model 

In the following we mimic some form of mfe-^-structures: 
inspired by the Nussinov energy model [39] we consider 
the weight of a ^-structure over n vertices o^,^ to be 
given by w{Gg^n) = cl, where c is a constant contribu- 
tion of a single arc and I is the number of arcs in Gg^^ 
[40]. Then by Assumption 1, we have the weight function 
W{Gg,n) = (6/16)^7]'^ = ((6/16)77^)^. Note that the case 
(6/16) = 1 corresponds to the uniform distribution, 
i.e. all ^-structure have identical weight. 

This approach requires to keep track of the number of 
arcs, i.e. we need to employ bivariate GF. In Theorem 1 
(b) we computed this bivariate GF and in Theorem 2 
(b) we derived from this bivariate GF E^(z,t), the GF of 
irreducible ^-structures over n vertices containing I arcs. 

The idea now is to substitute for the second indetermi- 
nant, ^, some fixed r = (6/16)r]^ e M. This substitution 
induces the formal power series 

Dg,r(z) = Eg(z,r)y 

which we regard as being parameterized by r. Obviously, 
setting r = 1 we recover Dg(z)f i.e. we have Dg(z) = 
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Dg^i(z) = Egiz, 1). Note that for r > 1/4, the polynomial 
TZ^ — z + 1 has no real root. Thus we have for r > 1/4 the 
asymptotics 

dg,rin) - ag,rn(^~^\," and dl,(n) ^ al,n(^~^) y,", 

(8) 

with identical exponential growth rates as long as the 
supercritical paradigm [42] applies, i.e. as long as yj, the 
real root of minimal modulus of 

( 

\(tz^-z+1)^J 4' 

is smaller than any singularity of :^^2~^- this situation 
T affects the constant Ug^r and the exponential growth rate 

y-c but not the sub-exponential factor n^^~^\ The latter 
stems from the singular expansion of Cg(z), Analogously, 
we derive the r -parameterized family of GF D|^(z) = 
E|(z, r). We set the contribution of a single arc c = 1 
and the constant rj = e, where e is the Euler number. 
Then we have the parameter r = (6/16)e^ ^ 1.0125. By 
abuse of notation we will omit the subscript r assuming 
r = (6/l6)e\ 

The main result of this section is that the set of A*- 
candidates is a small proportion of all entries. To put this 
size into context we note that the total number of entries 
considered for the A* -decomposition rule is given by 

n 

M(n) = ^(n-m + 1). 

m—l 

Theorem 3. Suppose an mfe-g-structure over an interval 
of length m is irreducible with probability d^(m)/dg(m), 



then the expected number of candidates of g-structures for 
sequences of lengths n satisfies 

^g{n) = e(n^) 
and furthermore, setting Eg(n) = Eg(n) /M(n) we have 

Eg(n)-^d'g{n)/dg(n)-^bg, 
where bg > 0 is a constant. 
We provide an illustration of Theorem 3 in Figure 6. 

Proof We proof the theorem by quantifying the proba- 
bility of [/,;] being a A* -candidate. In this case any (not 
necessarily unique) sub-structure, realizing the optimal 
solution Lip is A* -irreducible, and therefore an irre- 
ducible structure over [ /,;']. 

Let m = (/ — / + 1), by assumption, the probability that 
[ /,;] is a candidate conditional to the existence of a sub- 
structure over [ /,;] is given by 

X d!(m) 

([ /, /] I [ /] is a candidate) = — , (9) 

dg{m) 

Note that ([ /,;] | [ /,;] is a candidate) does not depend 
on the relative location of the interval but only on the 
interval-length. Let P^(m) = d|(m)/d^(m), then accord- 
ing to Theorem 1, 

(1 -6)^^mK^"0y^ < dg{m) < (l + 6)^^mK^"0y^, 
(l-6)4mK^"0y^ < d!(m) < (l + 6)4mK^"0y^, 




10 20 30 40 50 60 70 80 90 100 



(A) 



(B) 



Figure 6 The expected number of candidates for secondary and 1 -structures from an random input with a simplified arc-based energy 
model, Eo(w) and Ei (n): we compute the expected number of candidates obtained by folding 1 00 random sequences for secondary 
structures (A)(solid) and 1 -structures (B)(solid). We also display the theoretical expectations implied by Theorem 3 (A)(dashed) and (B)(dashed). 
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for m > mo where mo > 0 and 0 < 6 < 1 are constants. 
On the one hand 



FJm) 



dg(m) 

(1 + €')bg, 



d + e') — 

an 



(10) 



where = <J^/fl| > 0 is a constant. On the other hand, 
we have 



(1 - €")bg. 



a-€")- 



(11) 



Setting 6 = max{6^ 6^^}, we can conclude that Fg(m) ~ 
d|(m)/d^(m), see Figure 7. 

We next study the expected number of candidates over 
an interval of length m. To this end let 

= I [ is a A* -candidate of length m }|. 

The expected cardinality of the set of A* -candidates of 
length m = (J — encountered in the DP-algorithm is 
given by 

Eg(Xm) < (n-(m-l))P^(m), 

since there are n — (m — 1) starting points for such an 
interval [/,;]. Therefore, by linearity of expectation, for 
sufficiently large m > mo, Fg{m) < (l-\-e)bg with e being 
a small constant. Thus we have 



Eg(n) = Eg I Y^Xyn I < ^(n-m + l)P^(m) + (l + e)hg 

\ m I m=l 

n 

— m + 1). 



m—mo 



(12) 



Consequently, the expected size of the A* -candidate set 
is We proceed by comparing the expected number 

of candidates of a sequence with length n with M(n), 

Er=i(P^('^) - (1 + ^)^^)(^ - m + 1) 



< (1 + e)bg + 



< (1 + €)bg + 



k ' n 



For sufficient large n > no, Eg(n)/M(n) < (1 + e')bg. 
Furthermore 

M(n) - Em=i(^-'^ + l) 

> (1 - ^)bg, 

from which we can conclude Eg(n)/M(n) ~ d|(m)/ 
dg(m) ~ Z?^ and the theorem is proved. □ 

Loop-based energy model 

In this section we discuss the loop-based energy model of 
RNA secondary structure folding. To be precise we evoke 
here trivariate GFs F(z, ^, v) and F*(z, ^, v) whose coeffi- 
cients counting the numbers of secondary structures and 
irreducible secondary structures over n vertices having I 
arcs and energy y, respectively. This becomes necessary 
since the loop-based model distinguishes between arcs 




20 40 60 80 100 120 140 160 180 200 



m 



20 40 60 80 100 120 140 160 180 200 



(A) (B) 

Figure 7 The probability distribution of Po(m) (A) and Pi(m) (B) on a simplified arc-based energy model. 



m 
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and energy. The "cancelation" effect or reparameterization 
of stickiness [40] to which we referred to before does not 
appear in this context. Thus we need both an arc- as well 
as an energy-filtration. 

A further complication emerges. In difference to the GFs 
Eg(z, t) and E|fe t) the new GFs are not simply obtained 
by formally substituting {tz^ / {(tz^ —z-\- 1)^) into the power 
series Dg{z) and as bivariate terms. The more com- 
plicated energy model requires a specific recursion for 
irreducible secondary structures. 

The energy model used in prediction of secondary 
structure is more complicated than the simple arc-based 
energy model. Loops which are formed by arcs as well 
as isolated vertices between the arcs are considered to 
give energy contribution. Loops are categorized as hair- 
pin loops (no nested arcs), interior loops (including bulge 
loops and stacks) and multi-loops (more than two arcs 
nested), see Figure 8. An arbitrary secondary structure can 
be uniquely decomposed into a collection of mutually dis- 
joint loops. A result of the particular energy parameters 
[8] is that the energy model prefers interior loops, in par- 
ticular stacks (no isolated vertex between two parallel arc), 
and disfavors multi-loops. Based on this observation, we 
give a simplified energy model for a loop k contained in 
secondary structure which only depends on the loop types 
by 

• w(k) = 0.5 if A. is a hairpin loop, 

• w(X) = 1 if A. is an interior loop, 

• w(X) = —5 if A. is a multi-loop, 

where A. is a loop in a structure. The energy for a secondary 
structure a accordingly is given by 

w(cr) = J2^(X), (13) 

Let Fq(z) and To(z) be the energy-filtered GFs obtained 
by setting t = 6/16 and v = r] = e in F*(z, ^, v) and 
F(z, t, v), where e is the Euler number. Then 



a ^ ^ a 



where a is an arbitary and a' is an irreducible sec- 
ondary structure. Along these lines, I' denote the 
number of arcs in a and In other words, what hap- 
pens here is that we find a suitable parameterization 
which brings us back to a simple univariate GF whose 
coefficients count the sum of weights of structures over 
n vertices. 

Lemma 3. The energy -filtered generating function of 
RNA secondary structures, Fq {z), satisfies the recursion 



,0.5 2 



6 1 2 

H e^z^ 



^16 



1 - z 16 



(14) 



1- 



and F*(z) is uniquely determined by the above equation. 
Furthermore 



Fo(^) = 



1-^1-F*(z) 



(15) 



Proof We first consider the GF Fq(z) whose coefficient 
of z^ denotes the total weight of irreducible secondary 
structures over n vertices, where (1, n) is an arc. Thus it 
gives a term 6/16z^, Isolated vertex lead to the term 



oo 
/=0 



where p denotes the minimum number of isolated vertices 
to be inserted. Depending on the types of loops formed by 
(/, n)y we have 

• hairpin loops: ^3^, 

• interior loops: Fq(z) ^Y^i) > 

• multi-loops: there are at least two irreducible 
sub-structures, as well as isolated vertices, thus 

V 2 



1 



/=2 




(A) (B) 




Figure 8 Diagram representation of loop types in secondary structures: (A) liairpin loop, (B) interior loop, (C) multi-loop. 
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Considering the contributions from the energy model we 
compute 



/ 



16 



1 — z 



1 



i-n(z)^,i-z 



which estabUshes the recursion. The uniqueness of the 
solution as a power series follows from the fact that each 
coefficient can evidently be recursively computed. 

An arbitrary secondary structure can be considered 
as a sequence of irreducible sub-structures with certain 
intervals of isolated vertices. Thus 

1 1 

Fo(z) = V F*(z) = ^. 



i=0 



bigger than p. Then by the supercritical paradigm 
[42] applying, Fo(z) and Fq(z) have identical expo- 
nential growth rates. Furthermore, 'Pq(z) and Fo(z) 

3 

have the same sub-exponential factor n~2, hence the 
lemma. □ 



Theorem 4. Suppose an mfe-secondary structure over 
an interval of length m is irreducible with probability 

f*(m) 

Po(m) = then the expected number of candidates 

from a random sequence of length n with a simplified 
loop-based energy model is 



and furthermore, settingKg(n) = Kg(n)/M(n), we have 



Eo(n)~f^(n)/fo(n)~^, 



□ where b = a/p ^ 0.08. 



Lemma 4. Fq(z) and To(z) have the same singular 
expansion. 



f^(n) - an-2y^^ and i^(n) - ^^^"2^^, 



(16) 



where a ^ 0.24 and P ^ 2.88 are constants and y ^ 
2.1673 

Proof Solving eq. 14 we obtain a unique solution 
for Fq(z) whose coefficient are all positive. Observ- 
ing the dominant singularity of Fq(z) is p ^ 0.4614. 
Fo(z) is a function of Fq(z) and we examine the 
real root of minimal modulus of 1 — Fq {z) = 0 is 



Proof By Lemma 4 we have f^(m)/fo(m) ~ b where b 
is a constant. The proof is completely analogous to that of 
Theorem 3. □ 

We show the distribution of Po(^) and Eo(n) in 
Figure 9. 

Conclusion 

In this paper we quantify the effect of sparsification of 
the rule A*. This rule splits intervals and separates con- 
catenated sub-structures. The sparsification of A* alone 
is claimed to provide a speed up of up to a linear fac- 
tor of the DP-folding of RNA secondary structures [29]. 



Po(m) 



Eo(n) 




100 200 300 400 500 600 700 800 900 



(A) 



m 



100 200 300 400 500 600 700 800 900 



(B) 



Figure 9 The distribution of Pq {m) (A) and Eq {n) obtained by folding 1 00 random sequences on the loop-based model (B)(solid), as well 
as the theoretical expectation implied by Theorem 4 (B)(dashed). 
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A similar conclusion is drawn in [30] where the spar- 
sification of RNA-RNA interaction structures is shown 
to experience also a linear reduction in time complexity. 
Both papers [29,30] base their conclusion on the valid- 
ity of the polymer-zeta property. However, [34] comes to 
a different conclusion reporting a mere constant reduc- 
tion in time complexity. While A* is the key for the time 
complexity reduction of secondary structure folding, it 
is conceivable that for pseudoknot structures there may 
exist non-sparsifiable rules in which case the overall time 
complexity is not reduced. 

In any case, the key is the set of candidates and we 
provide an analysis of A* -candidates by combinatorial 
means. In general, the connection between candidates, 
i.e. unions of disjoint intervals and the combinatorics of 
structures is actually established by the algorithm itself via 
backtracking: at the end of the DP-algorithm a structure 
is being generated that realizes the previously computed 
energy as mfe-structure. This connects intervals and sub- 
structures. 

So, does the condition c > 1 in polymer-zeta apply 
in the context of RNA structures? In fact this condition 
would follow if the intervals in question are distributed 
as in uniformly sampled structures. This however, is far 
from reasonable, due to the fact that the mfe-algorithm 
deliberately designs some mfe-structure over the given 
interval What the algorithm produces is in fact antag- 
onistic to uniform sampling. We here wish to acknowl- 
edge the help of one anonymous referee in clarifying this 
point. 

Our results imply that polymer-zeta does not hold. 
Our framework critically depends on a specific distri- 
bution of mfe structures within irreducible and arbi- 
trary structures, explicated in Assumption 1. We have 
cross-checked Assumption 1 with the number of can- 
didates in DP-programs (using the same energy model), 
see Figure 7 and Figure 9. With this conclusion we are 
in accord with [31,34] but provide an entirely different 
approach. 

The non validity of polymer-zeta has also been observed 
in the context of the limit distribution of the 5'-3' distances 
of RNA secondary structures [43]. Here it is observed 
that long arcs, to be precise arcs of lengths 0(n) always 
exist. This is of course a contradiction to the polymer-zeta 
property in case of c > 1. 

The key to quantification of the expected number of 
candidates is the singularity analysis of a pair of energy- 
filtered GF, namely that of a class of structures and 
that of the subclass of all such structures that are irre- 
ducible. We show that for various energy models the 
singular expansions of both these functions are essen- 
tially equal-modulo some constant. This implies that the 
expected number of candidates is SirP') and all constants 
can explicitly be computed from a detailed singularity 



analysis. The good news is that depending on the energy 
model, a significant constant reduction, around 96% can 
be obtained. This is in accordance with data produced 
in [31] for the mfe-folding of random sequences. There 
a reduction by 98% is reported for sequences of length 
> 500. 

Our findings are of relevance for numerous results, that 
are formulated in terms of sizes of candidate sets [32]. 
These can now be quantified. It is certainly of inter- 
est to devise a full fledged analysis of the loop-based 
energy model. While these computations are far from 
easy our framework shows how to perform such an 
analysis. 

Using the paradigm of gap-matrices Backofen has 
shown [32] that the sparsification of the DP-folding of 
RNA pseudoknot structures exhibits additional instances, 
where sparsification can be applied, see Figure 5B. Our 
results show that the expected number of candidates is 
SirP'), where the constant reduction is around 90%. This 
is in fact very good new since the sequence length in the 
context of RNA pseudoknot structure folding is in the 
order of hundreds of nucleotides. So sparsification of fur- 
ther instances does have an significant impact on the time 
complexity of the folding. 

Proofs 

In this section, we prove Lemma 2 and Theorem 2. 
Proof for Lemma 2: let D(z,u) and T>''{z,u) be 

the bivariate GF ^{z,u) = ^^Q^ l^Qdg{n)z^u^, and 



D*(z, m) = J]«>i Zl^io^|W^^^^- Suppose a structure 



contains exactly / irreducible structures, then 

D(2, = V R(2, uy = - — ^ — - 
^ 1 - R(;^, u) 



(17) 



and 



1 



D(z, u) 



. g>h (18) 



as well as D*(^) = l-[u^] Let ^{z,u) = J2n>o 

Jlg>o = Then ¥(z,u)D(z,u) = 1, 

whence for ^ > 1, 

g 

J2 ^gi (z)^g-gi (z) =[^^] Ffe ^)E>fe ^) = 0> (19) 

gi=o 

and Fo(;^)Do(;^) = 1, where ¥g(z) = J2n>o^g(^^^'' = 
[ ¥(Zy u) — [ Furthermore, we have Fo(;^) = 



ErioFgi(^)Dg-gi(^) 
Do(z) 



, g>l, (20) 
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which implies Dq(z) = 1 — ¥o(z) = 1 — and 



and setting h = gi — j we obtain, 



D!(z) = -¥Jz) 



Do(z) 



(21) 



Proof for Theorem 2 Let [n]/^ denote the set of com- 
positions of n having k parts, i.e. for a we have 
or = (ai, . . . , a^) and Yl^i (^i = n. 
Claim, 



(^+1-7 ^ 



(22) 



^1=1 ;=o 



= EE5^( E nD.(-))D,.i-,,(.) 



h=2 ^ \or€[^lk / 



J_ (-1)^+2 



/z+1 



= E5^ E n^*) 

h=2 ^^^^ Va^€[^+1],, 1 



and setting J = g — h 



We shall prove the claim by induction on g. For ^ = 1 
we have 



(23) 



whence eq. (22) holds for ^ = 1. By induction hypoth- 
esis, we may now assume that for / < g, eq. (22) holds. 
According to Lemma 2, we have 



D|+i(^) = 



(D*(z) - l)Dg+i(z) +Tfg,=i D|/z)Dg+i-gi(z) 
Doiz) 



= E^^ E 'ff''.;<4 

)=0 ^ \<T'€|^+l]^+i_y /=1 / 

Consequently, the Claim holds for anyg > 1. 
For any > 1, we have [11] 



Do(z) = 



Pg{u) 



z2_2+l(l_4^,)3^-l/2' 

1 2 
z2 - z + 1 (1 + vT^^) ' 



Dg+i(z) ^ / Dg,(z) 
Do(z)2 I Do(z)- 



3 



J=0 



Do(z)^i+2-/ 



We next observe 



where Vg{u) is a polynomial with integral coefficients of 
degree at most (3^ - 1), P^(l/4) ^ 0, [ u^^ Vg{u) # 0 and 

[ u^] Vg{u) = 0 for 0 < /z < 2^ - 1. Let = j^^:^. the 
Claim provides in this context the following interpretation 
ofD*(^) 



1 Vg{u) /i + VT^^ 

t>^(^) = 



(1 - 4m)3^-V2 \ 2 



V ^^^^^^ n r.^ 



^-2 

+E 

;-0 



1 + Vl -4m 



(_l)<?+2-C?-l) 



^+i-(^-i) \ 
Do(,).+2-„-i) . E n D.;(z) , 

(24) 



(1 - 4m)3«-¥ 



(25) 
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and 



j=0 k=0 



■i±k 



= 2: E - 



As 0 < ; < ^ — 2 and ^ — ; < 5 < 2^ + 1 — 2/, we have 
5 > 2. Consequently we arrive at 

-l>(r(Z) = ■ .o, I/O + 



z^-z+l ' (1 - 4m)%-V2 (1 _ 4m)35-i ' 

(26) 



where 



^-2 



/=0 ^-;<5<2^+l-2; 
s is odd 



5-1 

X (1 - 4m) 2 , 



and 



^-3 



;=0 g-j<s<2g+l-2j 
s is even 



5-2 

X (1 - 4m) 2 . 
We have for a A" > 1 

["''ME np-.(«>) = E nt^'^ip-'^")' 

where X!f=i ^< = ^' — 0- Then we obtain that 



E n^'^'^") =0' 0 < ^ < 2^ - 1- (27) 



Since [ uJ"'] (m) = 0, hi < 2CT( - 1, [ m2<^'] P^,. (m) ^ 0 and 
ELi = S- Thus for 0 < /z < 2^ - 1, 

[ u''] \Jg(u) = 0 and [ u''] Vgiu) = 0. (28) 

As shown in [11] we have 

r{g- 1/6) r + 1/2) r + 1/6) 9^4-^ 



6;r3/2r(^+l) 



(29) 



and we obtain Uj(l/4) — P^(l/4)/4. Furthermore, 
V,(l/4) = ?^ + (-l)V^nP,(l/4) 



^-1 



= - |^4P^(l/4) - ^P;(l/4)P^_;(l/4)j ^ 0. 

We can recruit the computation of [11] in order to 
observe 4P^(l/4) - Py(l/4)P^_y(l/4) ^ 0. In order 
to compute the bivariate GF, E|(;^, 0> we only need to 
replace in eq. (22) Dg(z) by Eg(Zft) and the proof is 
completely analogous. 
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